%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% !TEX root = ../sutton_learning_1988.tex
\chapter{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{center}
  \begin{minipage}{0.5\textwidth}
    \begin{small}
      In which the reasons for creating this package are laid bare for the
      whole world to see and we encounter some usage guidelines.
    \end{small}
  \end{minipage}
  \vspace{0.5cm}
\end{center}

This article concerns the problem of learning to predict,
that is,
of using past experience with an incompletely known system to
predict its future behavior.
For example,
through experience one might learn to
predict for particular chess positions whether they will lead to a win,
for particular cloud formations whether there will be rain,
or for particular economic conditions how much the stock market will rise or fall.
Learning to predict is one of the most basic and prevalent kinds of learning.
Most pattern recognition problems,
for example,
can be treated as prediction problems in which the classifier must predict the correct classifications.
Learning-to-predict problems also arise in heuristic search,
e.g.,
in learning an evaluation function that predicts the utility of searching particular parts of the search space,
or in learning the underlying model of a problem domain.
An important advantage of prediction learning is that its training examples can be taken directly from the temporal sequence of ordinary sensory input;
no special supervisor or teacher is required.

In this article,
we introduce and provide the first formal results
in the theory of \gls{td} methods,
a class of incremental learning procedures specialized for prediction problems.
Whereas conventional prediction-learning methods are driven by the error between predicted and actual outcomes,
TD methods are similarly driven by the error or difference between temporally successive predictions;
with them,
learning occurs whenever there is a change in prediction over time.
For example,
suppose a weatherman attempts to predict on each day of the week whether it will rain on the following Saturday.
The conventional approach is to compare each prediction to the actual outcome---whether or not it does rain on Saturday.
A TD approach,
on the other hand,
is to compare each day's prediction with that made on the following day.
If a 50\% chance of rain is predicted on Monday,
and a 75\% chance on Tuesday,
then a TD method increases predictions for days similar to Monday,
whereas a conventional method might either increase or decrease them depending on Saturday's actual outcome.

We will show that TD methods have two kinds of advantages over conventional prediction-learning methods.
First,
they are more incremental and therefore easier to compute.
For example,
the TD method for predicting Saturday's weather can update each day's prediction on the following day,
whereas the conventional method must wait until Saturday,
and then make the changes for all days of the week.
The conventional method would have to do more computing at one time than the TD method and would require more storage during the week.
The second advantage of TD methods is that they tend to make more efficient use of their experience:
they converge faster and produce better predictions.
We argue that the predictions of TD methods are both more accurate and easier to compute than those of conventional methods.

The earliest and best-known use of a TD method was in Samuel's (1959) celebrated checker-playing program.
For each pair of successive game positions,
the program used the difference between the evaluations assigned to the two positions to modify the earlier one's evaluation.
Similar methods have also been used in
Holland's bucket brigade \citep{holland_escaping_1986},
in the author's Adaptive Heuristic Critic
\citep{sutton_temporal_1984, barto_neuronlike_1983}
and in learning systems studied by
\citet{witten_adaptive_1977},
\citet{booker_intelligent_1982},
and
\citet{hampson_neural_1983}.
TD methods have also been proposed as models of classical conditioning
\citep{
    sutton_toward_1981,
    sutton_temporal-difference_1987,
    gelperin_logic_1985,
    moore_simulation_1986,
    klopf_neuronal_1987%
}.



Nevertheless,
TD methods have remained poorly understood.
Although they have performed well,
there has been no theoretical understanding of how or why they worked.
One reason is that they were never studied independently,
but only as parts of larger and more complex systems.
Within these systems,
TD methods were used to improve evaluation functions by better predicting goal-related events
such as rewards, penalties, or checker game outcomes.
Here we advocate viewing TD methods in a simpler way---as methods for efficiently learning to predict arbitrary events,
not just goal-related ones.
This simplification allows us to evaluate them in isolation and has enabled us to obtain formal results.
In this paper,
we prove the convergence and optimality of TD methods for important special cases,
and we formally relate them to conventional \gls{sl} supervised-learning procedures.

Another simplification we make in this paper is to focus on numerical prediction processes rather than on rule-based or symbolic prediction
\citep[e.g.,][]{dietterich_learning_1986}.
The approach taken here is much like that used in connectionism and in Samuel's original work---our predictions are based on numerical features combined using adjustable parameters or ``weights''.
This and other representational assumptions are detailed in Section 2.

Given the current interest in learning procedures for multi-layer connectionist networks
\citep[e.g.,][]{
    rumelhart_learning_1985,
    ackley_learning_1985,
    barto_learning_1985,
    anderson_learning_1986,
    williams_reinforcement_1986,
    hampson_disjunctive_1987%
},
we note that here we are concerned with a different set of issues.
The work with multi-layer networks focuses on learning input-output mappings of more complex functional forms.
Most of that work remains within the supervised-learning paradigm,
whereas here we are interested in extending and going beyond it.
We consider mostly mappings of very simple functional forms,
because the differences between supervised learning methods and TD methods are clearest in these cases.
Nevertheless,
the TD methods presented here can be directly extended to multi-layer networks (see Section 6.2).

The next section introduces a specific class of temporal-difference procedures by contrasting them with conventional,
supervised-learning approaches,
focusing on computational issues.
Section 3 develops an extended example that illustrates the potential performance advantages of TD methods.
Section 4 contains the convergence and optimality theorems and discusses TD methods as gradient descent.
Section 5 discusses how to extend TD procedures,
and Section 6 relates them to other research.

